Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

6.2 Constraint

StartLayout 1st Row 1st Column right arrow 2nd Column upper C 3rd Column upper V 2nd Row 1st Column upper C 2nd Column p Subscript c c 3rd Column p Subscript c v 3rd Row 1st Column upper V 2nd Column p Subscript v c 3rd Column p Subscript v v EndLayout

→C

C pcc pcv

V pvc pvv

(6.15)

wherep Subscript c cpcc means the probability that a consonant is followed by another consonant,

and similarly for the other terms. The matrix is stochastic; that is, the rows must add

up to 1. If every column is identical, then there is no dependence on the preceding

symbol, and we revert to a random, or zeroth-order Markov, process. Suppose now

that observation reveals that the probability of C occurring after V preceded by C is

different from that of C occurring after V preceded by V, or even that the probability

of C occurring after VV preceded by C is different from that of C occurring after

VV preceded by V. These higher-order Markov processes can be recoded in strict

Markov form; thus, for the second-order process (dependency of the probabilities on

the two preceding symbols) “VVC” can be written as a transition from VV to VC,

and hence the matrix of transition probabilities becomes

StartLayout 1st Row 1st Column right arrow 2nd Column CC 3rd Column CV 4th Column VC 5th Column VV 2nd Row 1st Column CC 2nd Column p Subscript c c c 3rd Column p Subscript c c v 4th Column 0 5th Column 0 3rd Row 1st Column CV 2nd Column 0 3rd Column 0 4th Column p Subscript c v c 5th Column p Subscript c v v 4th Row 1st Column VC 2nd Column p Subscript v c c 3rd Column p Subscript v c v 4th Column 0 5th Column 0 5th Row 1st Column VV 2nd Column 0 3rd Column 0 4th Column p Subscript v v c 5th Column p Subscript v v v EndLayout

→CC CV VC VV

CC pccc pccv

pcvc pcvv

VC pvcc pvcv

pvvc pvvv

(6.16)

and so on for higher orders. Notice that some transitions necessarily have zero

probability. ¹²

The reader may object that one rarely composes text letter by letter, but rather

word by word. Clearly, there are strong constraints governing the succession of words

in a text. The frequencies of these successions can be obtained by counting word

occurrences in very long text and are then used to construct the transition matrix,

which is, of course, gigantic even for a ﬁrst-order process. We remark that a book

ending with “…in the solid state is greatly aided by this new tool” is more likely to

begin with “Rocket motor design received a considerable boost when …” than one

ending “I became submerged in my thoughts which sparkled with a cold light”. ¹³

We note here that clearly one may attempt to model DNA or protein sequences

as Markov processes, as will be discussed in Part III. Markov chains as such will be

discussed more fully in Chap. 11.

The notion of constraint applies whenever a set “is smaller than it might be”. The

classic example is that of road trafﬁc lights, which display various combinations of

red, amber, and green, each of which may be on or off. Although 2 cubed equals 82³= 8 combi-

nations are theoretically possible, in most countries only certain combinations are

used, typically only four out of the eight. Constraints are ubiquitous in the universe

12 See also Sect. 11.2.

13 Good (1969) has shown that ordinary language cannot be represented even by a Markov process

of inﬁnite order. So-called hidden Markov models (HMM), discussed elsewhere in this book, offer

in principle a more powerful representational tool.